**Scheduling Jobs** ------------------------- **Anatomy of a job** ====================== Jobs are made of one or multiple **sequential steps**, each consisting in one or multiple **parallel tasks** that will be dispatched to possibly distinct nodes. Each task will be allocated CPUs, memory, and possible other generic resources in an exclusive manner by Slurm. Two jobs cannot share the same resource unless explicitly forced by the admins, but that is generally not the case. Therefore, jobs can only start when all needed resources are free and not needed by another higher priority job. Jobs are indeed assigned a **priority** when they are submitted, which can depend upon multiple factors. Each job has two parts: - Resource requests: describe the amount of computing resource (CPUs, memory, expected run time, etc.) that the job will need to successfully run. - Job steps: describe individual tasks that must be executed into a job. You can execute a job step with the SLURM command srun. A job can has one or more steps, each consisting in one or more tasks each using one or more CPU, GPU, etc. **Job Types** ================ There are a two popular types of jobs you could submit to the HPC cluster: - `batch <#scheduling-a-batch-job>`__ which are unattended shell scripts - `interactive <#interactive-jobs>`__ where you run and test out your workflows live interactively on the command line, **Starting a SLURM Job** =============================== There are two ways of starting jobs with SLURM; either interactively with ``srun`` or as a script with ``sbatch`` commands. **Defining and submitting A Batch job** ========================================== For the scheduling process to work properly, you will need to describe your job before you submit it: - what are the steps (i.e. which program must be run and how) ; - how many tasks there will be ; - what resource each task needs (CPU, memory, etc.), and - for how long the job is supposed to run. All of these, along with potentially aditionnal job parameter submission options, can be described in a **submission script**. A submission script is a `shell script `__, e.g. a `Bash `__ script, whose comments, if they are prefixed with #SBATCH, are understood by Slurm as directives describing resource requests and other submissions options. You can get the complete list of parameters from the sbatch manpage man sbatch. .. Important:: The #SBATCH directives must appear at the top of the submission file, before any other line except for the very first line which should be the `shebang `__ (e.g. ``#!/bin/bash`` ). The script itself is a job step. Other job steps are created with the ``srun`` command, that takes as argument the name and options of the program to run. .. Note:: If there is only one step, the srun command can be omitted, but it has consequences in how signals are handled and how accurate reporting is. It is often best to use it in all cases. Below is a slurm script template to Submit a batch job from the login node by calling sbatch .slurm. .. code-block:: python $cat script.slurm #!/bin/bash #SBATCH --partition=debug # partition name. Eg. Debug, bio, bigmem #SBATCH --job-name=demosample # job name #SBATCH --nodes=2 # number of nodes allocated for this job #SBATCH --ntasks=2 # total number of tasks / mpi processes #SBATCH --cpus-per-task=8 # number OpenMP Threads per process #SBATCH --time=08:00:00 # total run time limit ([[D]D-]HH:MM:SS) ##SBATCH --gres=gpu:tesla:2 # number of GPUs ##The above line and this will be ignored by Slurm # Get email notification when job begins, finishes or fails #SBATCH --mail-type=ALL # type of notification: BEGIN, END, FAIL, ALL #SBATCH --mail-user=your@mail # e-mail address SBATCH --chdir= #SBATCH --export=all #SBATCH --output= # where STDOUT goes #SBATCH --error= # where STDERR goes # Modules to use (optional). # # Run the application. # echo [`date '+%Y-%m-%d %H:%M:%S'`] Running $AE_ARCH srun hostname sleep 60 For instance, the following batch script, hypothetically named submit.slurm, .. code-block:: python #!/bin/bash # #SBATCH --job-name=testrun #SBATCH --output=result.txt #SBATCH --partition=debug # #SBATCH --time=5:00 #SBATCH --ntasks=1 #SBATCH --cpus-per-task=1 #SBATCH --mem-per-cpu=200 srun hostname srun sleep 30 describes a job made of 2+1 steps (the submission script itself plus two explicit steps created by calling srun twice), each step consisting of only one task that needs one CPU and 200MB of RAM. The first step will run the ``hostname`` command, and the second one the useless ``sleep`` command. The job is supposed to run for *5 minutes* on the debug partition, be named testrun, and create an output file named *result.txt*. .. Important:: It is important to note that as the job will run unattended, it will not be attached to a terminal (screen) so everything that the program that is run would like to write to the terminal will be redirected by Slurm to a file, whose name is specified by the --output parameter. (Note that for the same reasons, the program cannot expect input from the user through the keyboard to run properly.) Once the submission script is written properly, you need to **submit** it to slurm through the ``sbatch`` command, which, upon success, responds with the *jobid* attributed to the job. (The dollar sign below is the `shell prompt `__) .. code-block:: python $ sbatch submit.slurm sbatch: Submitted batch job 12321 .. Warning:: Make sure to submit the job with sbatch and not bash; also do not execute it directly. This would ignore all resource request and your job would run with minimal resources, or could possible run on the login node rather than on a compute node. The job then enters the Job queue in the *PENDING* state which be verified with; .. code-block:: python $ squeue --me Once resources become available and the job has highest priority, an **allocation** is created for it and switches to the RUNNING state. If the job completes correctly, it goes to the *COMPLETED* state, otherwise, it is set to the *FAILED* state. **Slurm Arguments** ====================== These are the common and recommended arguments suggested at a minimum to get a job in any form. +-------+-------+-----------------------------------------------------+ | **A | **Co | **Notes** | | rgume | mmand | | | nts** | Fl | | | | ags** | | +=======+=======+=====================================================+ | Ac | -A or | What lab are you part of? If you run | | count |  --ac | the groups command you can see what groups (usually | | | count | labs) you're a member of, these are associated with | | | | resource limits on the cluster. | +-------+-------+-----------------------------------------------------+ | Part | -p | What resource partition are you interested in | | ition |  or - | using? This could be anything you see when you | | | -part | run sinfo -s as each partition corresponds to a | | | ition | class of nodes (e.g., high memory, GPU). | +-------+-------+-----------------------------------------------------+ | Nodes | -N  | How many nodes are these resources spread across? | | | or -- | In the overwhelming number of cases this is 1 (for | | | nodes | a single node) but more sophisticated multi-node | | | | jobs could be run if your code supports it. | +-------+-------+-----------------------------------------------------+ | Cores | - | How many compute cores do you need? Not all codes | | | c or  | can make use of multiple cores and if they do, the | | | --cpu | performance of the code is not always linear with | | | s-per | the resources requested. If in doubt consider | | | -task | contacting the research computing team to assist in | | | | this optimization. | +-------+-------+-----------------------------------------------------+ | M | --mem | How much memory do you need for this job? This is | | emory | | in the format size[units] were size is a number and | | | | units are either M, G, or T for megabyte, gigabyte, | | | | and terabyte respectively. Megabyte is the default | | | | unit if none is provided. | +-------+-------+-----------------------------------------------------+ | Time | -t | What's the maximum runtime for this job? Common | | |  or - | acceptable time formats | | | -time | include hours:minutes:seconds, days-hours, | | | | and minutes. | +-------+-------+-----------------------------------------------------+ **Slurm Environment Variables** ================================ When a job scheduled by Slurm begins, it needs to about how it was scheduled, what its working directory is, who submitted the job, the number of nodes and cores allocated to it, etc. This information is passed to Slurm via environment variables. Additionally, these environment variables are also used as default values by programs like mpirun. To view a node's Slurm environment variables, use export \| grep SLURM. A comprehensive list of the environment variables Slurm sets for each job can be found at the end of the *sbatch man page*. **Interactive jobs** ========================== Slurm jobs are normally batch jobs in the sense that they are run unattended. If you want to have a direct view on your job, for tests or debugging, you have two options. If you need simply to have an interactive Bash session on a compute node, with the same environment set as the batch jobs, run the following command: .. code-block:: python srun --pty bash -l Doing that, you are submitting a 1-CPU, default memory, default duration job that will return a Bash prompt when it starts. If you need more flexibility, you will need to use the `salloc `__ command. The salloc command accepts the same parameters as sbatch as far as resource requirement are concerned. Once the allocation is granted, you can use srun the same way you would in a submission script. **Starting an interactive job** You can run an interactive job like this: .. code-block:: python $ srun --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i Here we ask for a single core on one interactive node for one hour with the default amount of memory. The command prompt will appear as soon as the job starts. This is how it looks once the interactive job starts: .. code-block:: python srun: job 12345 queued **and** waiting **for** resources srun: job 12345 has been allocated resources Exit the bash shell to end the job. If you exceed the time or memory limits the job will also abort. Interactive jobs have the same policies as normal batch jobs, there are no extra restrictions. You should be aware that you might be sharing the node with other users, so play nice. Some users have experienced problems with the command, then it has helped to specify the cpu account: .. code-block:: python $ srun --account= --nodes=1 --ntasks-per-node=1 --time=01:00:00 --pty bash -i **Interactive Jobs (Single Node)** Resources for interactive jobs are attained either using ``salloc``. To request a compute node from the **bio partition** (biol) interactively consider the example below. *# Below replace the word account with an account name you belong to* *# Use sinfo to see your accounts and partitions* .. code-block:: python salloc -A account -p bio -N 1 -c 4 --mem=12G --time=2:45:00 You are asking slurm for 4 compute cores with 12GB of memory for 2 hours and 45 minutes spread across 1 node (single machine). The ``salloc`` command will automatically create an interactive shell session on an allocated node. **Interactive Jobs (Multi Node)** Building upon the previous section, if -N or --nodes is >1 when running ``salloc`` you are automatically placed into a shell of one of the allocated nodes. This shell is NOT part of a Slurm task. To view the names of the remainder of your allocated nodes use ``scontrol show hostnames``. The ``srun`` command can be used to execute a command on all of the allocated nodes as shown in the example session below. .. code-block:: python [user@allot ~]$ salloc -N 2 -p full -A stf --time=5 --mem=5G salloc: Pending job allocation 2620960 salloc: job 2620960 queued and waiting for resources salloc: job 2620960 has been allocated resources salloc: Granted job allocation 2620960 salloc: Waiting for resource configuration salloc: Nodes giga[002-003] are ready for job [user@allot ~]$ srun hostname giga002 giga003 .. note:: - If you are not allocated a session with the specified --mem value, try smaller memory values For more details, read the *salloc man page*. **Keeping interactive jobs alive** Interactive jobs die when you disconnect from the login node either by choice or by internet connection problems. To keep a job alive you can use a terminal multiplexer like tmux or screen. tmux allows you to run processes as usual in your standard bash shell You start tmux on the login node before you get a interactive slurm session with srun and then do all the work in it. In case of a disconnect you simply reconnect to the login node and attach to the tmux session again by typing: .. code-block:: python tmux attach or in case you have multiple sessions running: .. code-block:: python tmux list-session tmux attach -t SESSION_NUMBER As long as the tmux session is not closed or terminated (e.g. by a server restart) your session should continue. To log out a tmux session without closing it you have to press CTRL-B (that the Ctrl key and simultaneously “b”, which is the standard tmux prefix) and then “d” (without the quotation marks). To close a session just close the bash session with either CTRL-D or type exit. You can get a list of all tmux commands by CTRL-B and the ? (question mark). See also `this page `__ for a short tutorial of tmux. Otherwise working inside of a tmux session is almost the same as a normal bash session. **Job related environment variables** Here we list some environment variables that are defined when you run a job script. These is not a complete list. Please consult the `SLURM documentation `_ for a complete list. +--------------------------+--------------------------------------------------------------------------+ | Variable | Description | +==========================+==========================================================================+ | $SLURM_JOB_ID | The Job ID. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_JOBID | Deprecated. Same as $SLURM_JOB_ID | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_SUBMIT_DIR | The path of the job submission directory. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_SUBMIT_HOST | The hostname of the node used for job submission. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_JOB_NODELIST | Contains the definition (list) of the nodes that is assigned to the job. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NODELIST | Deprecated. Same as SLURM_JOB_NODELIST. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_CPUS_PER_TASK | Number of CPUs per task. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_CPUS_ON_NODE | Number of CPUs on the allocated node. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_JOB_CPUS_PER_NODE | Count of processors available to the job on this node. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_CPUS_PER_GPU | Number of CPUs requested per allocated GPU. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_MEM_PER_CPU | Memory per CPU. Same as --mem-per-cpu . | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_MEM_PER_GPU | Memory per GPU. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_MEM_PER_NODE | Memory per node. Same as --mem . | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_GPUS | Number of GPUs requested. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NTASKS | Same as -n, –ntasks. The number of tasks. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NTASKS_PER_NODE | Number of tasks requested per node. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NTASKS_PER_SOCKET | Number of tasks requested per socket. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NTASKS_PER_CORE | Number of tasks requested per core. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NTASKS_PER_GPU | Number of tasks requested per GPU. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NPROCS | Same as -n, --ntasks. See $SLURM_NTASKS. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_NNODES | Total number of nodes in the job’s resource allocation. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_TASKS_PER_NODE | Number of tasks to be initiated on each node. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_ARRAY_JOB_ID | Job array’s master job ID number. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_ARRAY_TASK_ID | Job array ID (index) number. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_ARRAY_TASK_COUNT | Total number of tasks in a job array. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_ARRAY_TASK_MAX | Job array’s maximum ID (index) number. | +--------------------------+--------------------------------------------------------------------------+ | $SLURM_ARRAY_TASK_MIN | Job array’s minimum ID (index) number. | +--------------------------+--------------------------------------------------------------------------+